NSF PAR Search | NSF Public Access Repository

Learning from Irreproducibility: Introducing Data Leakage Case Studies for Machine Learning Education

Fund, Fraida; Saeed, Mohamed; Malik, Shaivi; Ishak, Kyrillos (July 2025, ACM)

Data leakage remains a pervasive issue in machine learning (ML), especially when applied to science, leading to overly optimistic performance estimates and irreproducible findings. Despite its prevalence, data leakage receives limited attention in ML education, in part due to the lack of accessible, hands-on teaching resources. To address this gap, we developed interactive learning modules in which students reproduce examples from academic publications that are affected by data leakage, then repeat the evaluation without the data leakage error to see how the finding is affected. These modules were deployed by the authors in two introductory machine learning courses, enabling students to explore common forms of leakage and their impact on model reliability. Following their engagement with these materials, student feedback highlighted increased awareness of subtle pitfalls that can compromise machine learning workflows.

Free, publicly-accessible full text available July 29, 2026

Quantization is often cited as a technique for reducing model size and accelerating deep learning. However, past literature suggests that the effect of quantization on latency varies significantly across different settings, in some cases even increasing inference time rather than reducing it. To address this discrepancy, we conduct a series of systematic experiments on the Chameleon testbed to investigate the impact of three key variables on the effect of post-training quantization: the machine learning framework, the compute hardware, and the model itself. Our experiments demonstrate that each of these has a substantial impact on the overall inference time of a quantized model. Furthermore, we make experiment materials and artifacts publicly available so that others can validate our findings on the same hardware using Chameleon, and we share open educational resources on this topic that may be adopted in formal and informal education settings.

Search for: All records